-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs #48403
[BugFix] Avoid hdfs fs manager interrupting the thread when exception occurs #48403
Conversation
fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java
Outdated
Show resolved
Hide resolved
@@ -708,6 +710,7 @@ private HdfsFs getFileSystemByCloudConfiguration(CloudConfiguration cloudConfigu | |||
} | |||
return fileSystem; | |||
} catch (Exception e) { | |||
Thread.interrupted(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to unify the exception handler for all FileSystems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
later
@@ -518,6 +518,7 @@ public HdfsFs getDistributedFileSystem(String scheme, String path, Map<String, S | |||
} | |||
return fileSystem; | |||
} catch (Exception e) { | |||
Thread.interrupted(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right way to handle the interuption event? Do we need to do something like retry? Always clear the interupted state even though there's no interuption happened? If we really need to eat the interupte exception, we should catch this type of exception explicitly and comment the reason i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a exception is caught, the interrupted flag set by the exception should usually be reset to not affect the caller thread, otherwise it is meaningless to catch this exception, just rethrow it and let the caller to handle it will be better.
the retry is done in the underlying code, no need to retry here.
if let the interrupted flag set, any subsequent blocking code in this thread will throw exception, like sleep
, wait
, read
, write
..., most of these blocking calls have been encapsulated, it's impossible to catch all the exceptions and retry in all blocking code,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still doubt if it is really caused by the thread interruption. This sounds to me like if a thread is interrupted for socket timeout, it won't be re-used again for further IO operation unless the interruption is cleared manually. Didn't hear about this when doing java IO programming.
A second thought is, even it is true, we'd better move the remote IO operation to a separate thread pool or thread while replying the image, so no matter what happens to the thread, it can be destroyed soon and won't affect the main thread at all.
Have a second look into the source code of JDK about these We shall pursue a different fix other than just reset the interruption flag. |
The interrupted flag should just be handled in the related blocking operations, should have no influence with unrelated blocking operations, I haven't figured out why it leaked. |
The uncleaned interrupted flag is always a hidden danger, and any thread that calls the hdfs fs manager may be affected. |
some long long discussion about this AbstractInterruptibleChannel behavior in this link: https://news.ycombinator.com/item?id=35125962 This issue should be fixed, but maybe in a different way. Concerning on the unknown impacts of clearing of the thread interruption. |
any suggestion about how to fix the issue ? |
still search and study, try to get a fully understand of the AbstractInterruptibleChannel thing. |
d74231f
to
04f369d
Compare
Signed-off-by: xiangguangyxg <[email protected]>
04f369d
to
424efcc
Compare
Quality Gate passedIssues Measures |
[FE Incremental Coverage Report]❌ fail : 0 / 52 (00.00%) file detail
|
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
@Mergifyio backport branch-2.5 |
✅ Backports have been created
|
✅ Backports have been created
|
✅ Backports have been created
|
✅ Backports have been created
|
✅ Backports have been created
|
… occurs (#48403) Signed-off-by: xiangguangyxg <[email protected]> (cherry picked from commit 3fcafac)
… occurs (#48403) Signed-off-by: xiangguangyxg <[email protected]> (cherry picked from commit 3fcafac)
… occurs (#48403) Signed-off-by: xiangguangyxg <[email protected]> (cherry picked from commit 3fcafac) # Conflicts: # fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java
… occurs (#48403) Signed-off-by: xiangguangyxg <[email protected]> (cherry picked from commit 3fcafac) # Conflicts: # fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java
… occurs (#48403) Signed-off-by: xiangguangyxg <[email protected]> (cherry picked from commit 3fcafac) # Conflicts: # fe/fe-core/src/main/java/com/starrocks/fs/hdfs/HdfsFsManager.java
… occurs (backport #48403) (#48695) Co-authored-by: xiangguangyxg <[email protected]>
… occurs (backport #48403) (#48696) Co-authored-by: xiangguangyxg <[email protected]>
… occurs (backport #48403) (#48697) Signed-off-by: xiangguangyxg <[email protected]> Co-authored-by: xiangguangyxg <[email protected]>
… occurs (backport #48403) (#48698) Signed-off-by: xiangguangyxg <[email protected]> Co-authored-by: xiangguangyxg <[email protected]>
… occurs (backport #48403) (#48699) Signed-off-by: xiangguangyxg <[email protected]> Co-authored-by: xiangguangyxg <[email protected]>
… occurs (StarRocks#48403) Signed-off-by: xiangguangyxg <[email protected]>
Why I'm doing:
When exception occurs in hdfs fs manager, the current thread will be interrupted, causing
InterruptedException
orClosedByInterruptException
in subsequent code execution.This may cause the following problem:
User backup data to a HDFS repository, and then releases the HDFS cluster. When FE starts, it will access the released HDFS cluster. If the access fails, the current thread will be interrupted, an
InterruptedIOException
orClosedByInterruptException
is thrown and the startup fails.What I'm doing:
Avoid hdfs fs manager interrupting the thread by clearing the interrupted flag when exception occurs.
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: